53 research outputs found

    Sinbad Automation Of Scientific Process: From Hidden Factor Analysis To Theory Synthesis

    Get PDF
    Modern science is turning to progressively more complex and data-rich subjects, which challenges the existing methods of data analysis and interpretation. Consequently, there is a pressing need for development of ever more powerful methods of extracting order from complex data and for automation of all steps of the scientific process. Virtual Scientist is a set of computational procedures that automate the method of inductive inference to derive a theory from observational data dominated by nonlinear regularities. The procedures utilize SINBAD – a novel computational method of nonlinear factor analysis that is based on the principle of maximization of mutual information among non-overlapping sources (Imax), yielding higherorder features of the data that reveal hidden causal factors controlling the observed phenomena. One major advantage of this approach is that it is not dependent on a particular choice of learning algorithm to use for the computations. The procedures build a theory of the studied subject by finding inferentially useful hidden factors, learning interdependencies among its variables, reconstructing its functional organization, and describing it by a concise graph of inferential relations among its variables. The graph is a quantitative model of the studied subject, capable of performing elaborate deductive inferences and explaining behaviors of the observed variables by behaviors of other such variables and discovered hidden factors. The set of Virtual Scientist procedures is a powerful analytical and theory-building tool designed to be used in research of complex scientific problems characterized by multivariate and nonlinear relations

    A Novel Partial Sequence Alignment Tool for Finding Large Deletions

    Get PDF
    Finding large deletions in genome sequences has become increasingly more useful in bioinformatics, such as in clinical research and diagnosis. Although there are a number of publically available next generation sequencing mapping and sequence alignment programs, these software packages do not correctly align fragments containing deletions larger than one kb. We present a fast alignment software package, BinaryPartialAlign, that can be used by wet lab scientists to find long structural variations in their experiments. For BinaryPartialAlign, we make use of the Smith-Waterman (SW) algorithm with a binary-search-based approach for alignment with large gaps that we called partial alignment. BinaryPartialAlign implementation is compared with other straight-forward applications of SW. Simulation results on mtDNA fragments demonstrate the effectiveness (runtime and accuracy) of the proposed method

    Spectral Clustering with Reverse Soft K-Nearest Neighbor Density Estimation

    No full text
    Spectral Clustering (sq is a kernel method to cluster data objects using eigenvectors derived from the data. One fundamental issue that cause poor cuts in SC is its sensitivity to outliers. Another fundamental problem is how to determine the kernel bandwidth from the data. In fact, these two problems are also closely related. One cannot be solved before solving the other. The answer lies in robust and nonparametric estimators of the data density. We propose Reverse Soft K-Nearest Neighbor Density Estimation (RSKNN) that determines the density around a data sample, thus this sample's potential (other used terms are weight or entropy), using all the other samples' nearest neighbors' scatter properties on the contrary to the common practice of using the nearest neighbors of a sample itself to determine its own density. The basic idea behind this can be summarized as "every sample have k neighbors that are the nearest but not every point can be in many points' k-nearest neighborhood". To demonstrate the use of it, we apply it to SC. Our package to use in SC consists of using RSKNN for estimating the density and the samples with high potential to be cluster centers helps in: 1) spectral decomposition phase to improve the generalization of the spectral cut criterion, 2) robust calculation of covariance matrix to be used in distance calculations, and 3) automatically determining the kernel bandwidth

    MISS ONE OUT: A CROSS-VALIDATION METHOD UTILIZING INDUCED TEACHER NOISE

    No full text
    Leave-one-out (LOO) and its generalization, K-Fold, are among most well-known cross-validation methods, which divide the sample into many folds, each one of which is, in turn, left out for testing, while the other parts are used for training. In this study, as an extension of this idea, we propose a new cross-validation approach that we called miss-one-out (MOO) that mislabels the example(s) in each fold and keeps this fold in the training set as well, rather than leaving it out as LOO does. Then, MOO tests whether the trained classifier can correct the erroneous label of the training sample. In principle, having only one fold deliberately labeled incorrectly should have only a small effect on the classifier that uses this bad-fold along with K - 1 good folds and can be utilized as a generalization measure of the classifier. Experimental results on a number of benchmark datasets and three real bioinformatics dataset show that MOO can better estimate the test set accuracy of the classifier

    Path-based Connectivity for Clustering Genome Sequences

    No full text
    Clustering is an unsupervised data mining tool and in bioinformatics, clustering genome sequences is used to group related biological sequences when there is no additional supervision. Sequence clusters are often related with gene/protein families, which can shed some light onto determining tertiary structures. To extract such hidden and valuable structures in a data set of genome sequences can benefit from better clustering methods such as the recently popular Spectral Clustering. In this study, we apply spectral clustering and its improved variations to sequence clustering task in our efforts to develop a novel approach for improving it

    Canonical Correlation Analysis for Multiview Semisupervised Feature Extraction

    No full text
    Hotelling's Canonical Correlation Analysis (CCA) works with two sets of related variables, also called views, and its goal is to find their linear projections with maximal mutual correlation. CCA is most suitable for unsupervised feature extraction when given two views but it has been also long known that in supervised learning when there is only a single view of data given, the supervision signal (class-labels) can be given to CCA as the second view and CCA simply reduces to Fisher's Linear Discriminant Analysis (LDA). However, it is unclear how to use this equivalence for extracting features from multiview data in semisupervised setting (i.e. what modification to the CCA mechanism could incorporate the class-labels along with the two views of the data when labels of some samples are unknown). In this paper, a CCA-based method supplemented by the essence of LDA is proposed for semi-supervised feature extraction from multiview data

    Improved Spiral Test Using Digitized Graphics Tablet for Monitoring Parkinson’s Disease

    No full text

    Improving Spectral Clustering Using Path-Based Connectivity

    No full text
    Spectral clustering is a recently popular clustering method, not limited to spherical-shaped clusters and capable of finding elongated arbitrary-shaped clusters. This graph theoretical clustering method can use Euclidean distance between each pair of examples as well as connectivity-based similarity measures based on shortest path or paths that do not travel over examples with big distances on the graph. In this paper, a hybrid method is proposed that utilizes distances used by spectral and path-based spectral clustering algorithms. The proposed hybrid methodis shown to be more robust than both methods

    A method for combining mutual information and canonical correlation analysis: Predictive Mutual Information and its use in feature selection

    No full text
    Feature selection is a critical step in many artificial intelligence and pattern recognition problems. Shannon's Mutual Information (MI) is a classical and widely used measure of dependence measure that serves as a good feature selection algorithm. However, as it is a measure of mutual information in average, under-sampled classes (rare events) can be overlooked by this measure, which can cause critical false negatives (missing a relevant feature very predictive of some rare but important classes). Shannon's mutual information requires a well sampled database, which is not typical of many fields of modern science (such as biomedical), in which there are limited number of samples to learn from, or at least, not all the classes of the target function (such as certain phenotypes in biomedical) are well-sampled. On the other hand, Kernel Canonical Correlation Analysis (KCCA) is a nonlinear correlation measure effectively used to detect independence but its use for feature selection or ranking is limited due to the fact that its formulation is not intended to measure the amount of information (entropy) of the dependence. In this paper, we propose a hybrid measure of relevance, Predictive Mutual Information (PMI) based on MI, which also accounts for predictability of signals from each other in its calculation as in KCCA. We show that PMI has more improved feature detection capability than MI, especially in catching suspicious coincidences that are rare but potentially important not only for experimental studies but also for building computational models. We demonstrate the usefulness of PM!, and superiority over MI, on both toy and real datasets. (C) 2011 Elsevier Ltd. All rights reserved
    corecore